IQSS/8286 enable i18n search of cvv fields #8435

qqmyers · 2022-02-17T23:10:31Z

What this PR does / why we need it: Before this, controlled vocabulary values were only being indexed by the cvv names (i.e. as in the tsv) and not by the language-specific translations. This creates #8286 in which users (in the UI or via API) don't have a good way to search for these values (UI users can use the facet, but putting the language-specific term in the facet in the search box would fail to find the same datasets.) As an incremental fix, this PR leaves the facet field (e.g. subect_ss) as is but indexes all of the configured language variants in the main field (e.g. subject). This means that searches against the field will work (e.g. subject:Chimie - French for Chemistry), and the same search via API will also work. Configured languages are those configured for either the display languages (which ones the overall UI can be shown in ) and those allowed for Dataset metadata (e.g. controlled by the metadata languages setting).

Which issue(s) this PR closes:

Closes #8286

Special notes for your reviewer:
As noted, this is incremental. The limitations are:

because the facet search is still against the facet field (the _s or _ss postfix), the URL and the facet tag showing which facets are used in the current search still show the untranslated value. It makes sense to keep a facet field with untranslated values (adding translations there would add facets), but because the tags showing which facets are used come directly from the solr fields involved, it is complex to backtrack to the underlying field to then find which translated values exist. (For example, a human knows subject_ss probably relates to the subject field (actually the dvSubject dataverse-level values go in here too) and may then know that this field is in the 'citation' block, but discovering that with code probably means scanning all blocks or keeping a map around,etc.
The _s or _ss fields are 'string' or 'strings' for solr whereas the main fields are type text_en. The effect of this is that searches against the main field (e.g. subject) can return partial/close hits whereas searches against the facet field (subject_ss) are full hits or miss. It may make sense to actually set the main field for CVV fields to type string(s) anyway since one presumably wants exact hits, but I decided against doing that in this PR a) to limit scope and b) hoping that the dynamic schema.xml generation code that isn't yet merged would simplify making such a change (and handling new CVV fields in custom blocks, etc.) In the meantime, since most facet CVV values are unlikely to overlap much, the practical effect is probably small. (I suppose search for 'and' could pull up several two part facets, etc.).

Suggestions on how to test this: Configure with multiple display and/or metadata languages and verify that searches using the basic search box work with the translated values. And that facet search also works as before. One could also test the API call with the same queries. Sciences PO might be able to assist in setup and/or testing.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: only in terms of allowing user entered translated terms to produce search results.

Is there a release notes update needed for this change?:

Additional documentation:

Conflicts: src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

landreev · 2022-03-14T22:06:10Z

Since the PR appears to have some overlapping parts with #8437, is there anything special about how the 2 need to be handled? It looks like it should be safe to merge them in any order... but is it?

landreev

I had to re-read the part about "strings" (_s and _ss) vs. text_en a couple of times... But I think it all makes sense.

qqmyers · 2022-03-15T01:46:27Z

I think all the PRs are independent w.r.t. merging. It has certainly been convenient to test them all together, but there shouldn't be any code dependencies.

…n_search_of_cvv_fields

qqmyers added 2 commits February 17, 2022 17:39

add i18n indexing of CVV

207fd1e

Conflicts: src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

i18n indexing for subject field/dataset cvv

b93913b

landreev self-assigned this Mar 14, 2022

landreev approved these changes Mar 14, 2022

View reviewed changes

landreev removed their assignment Mar 15, 2022

kcondon self-assigned this Mar 21, 2022

Merge remote-tracking branch 'IQSS/develop' into IQSS/8286-enable_i18…

a59be6d

…n_search_of_cvv_fields

kcondon merged commit c5d1df2 into IQSS:develop Mar 23, 2022

pdurbin added this to the 5.11 milestone Mar 29, 2022

qqmyers mentioned this pull request Apr 12, 2022

IQSS/8595 - update existing schema and code to make all cvv files multival in solr #8601

Merged

ErykKul mentioned this pull request Apr 20, 2022

Controlled vocabulary fields are not indexed under search name if languages are not configured in settings #8623

Closed

pdurbin mentioned this pull request Apr 21, 2022

Search: Newly created datasets or newly added cvv fields are not searchable #8633

Closed

ErykKul mentioned this pull request Apr 22, 2022

8623 solr index vocab #8624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQSS/8286 enable i18n search of cvv fields #8435

IQSS/8286 enable i18n search of cvv fields #8435

qqmyers commented Feb 17, 2022

landreev commented Mar 14, 2022

landreev left a comment

qqmyers commented Mar 15, 2022

IQSS/8286 enable i18n search of cvv fields #8435

IQSS/8286 enable i18n search of cvv fields #8435

Conversation

qqmyers commented Feb 17, 2022

landreev commented Mar 14, 2022

landreev left a comment

Choose a reason for hiding this comment

qqmyers commented Mar 15, 2022